Classifying Vehicles based on Geometric Features extracted from the Silhouettes

Impact of Dimensionality Reduction and Principal Component Analysis on Support Vector Machines Classifier

Data Description:

The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

Domain: Object recognition

Context:

The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

Attribute Information:

Learning Outcomes:

Objective:

Apply dimensionality reduction technique –PCA and train a model using principle components instead of training the model using just the raw data.

Steps and tasks:

  1. Data pre-processing –Perform all the necessary preprocessing on the data ready to be fed to an Unsupervised algorithm (10 marks) 2.Understanding the attributes -Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why (10 points) 3.Split the data into train and test (Suggestion: specify “random state” if you are using train_test_split from Sklearn) (5 marks) 4.Train a Support vector machine using the train set and get the accuracy on the test set (10 marks) 5.Perform K-fold cross validation and get the cross validation score of the model (optional) 6.Use PCA from Scikit learn, extract Principal Components that capture about 95% of the variance in the data –(10 points) 7.Repeat steps 3,4 and 5 but this time, use Principal Components instead of the original data. And the accuracy score should be on the same rows of test data that were used earlier. (hint: set the same random state) (20 marks) 8.Compare the accuracy scores and cross validation scores of Support vector machines –one trained using raw data and the other using Principal Components, and mention your findings (5 points)

Step 1: Import the necessary Libraries

Step 2: Load the dataset

Checking the shape of the dataset

Checking the datatypes and null records

Checking total number of Null values in the dataset

Checking duplicate entries in the dataset

Checking if the dataset has only Numeric Data

Step 3: Statistical Summary (Five Number Summary) of the Dataset

Observations from Initial Analysis:

  1. compactness, max.length_aspect_ratio, max.length_rectangularity, hollows_ratio, class has missing values. All other features doesn't have any missing data. Need to check on the Missing Data Imputation for these features.
  2. All the predictor features are of numeric datatype. Target column is non-numeric.
  3. compactness, circularity and skewness_about.2 seems to be Normally distributed based on the Mean and Median values. However, need to verify the same using Distribution plots in Univariate Analysis.
  4. All other features seems to exhibit skewness. Need to check the Outliers for these features.
  5. scaled_variance.1 and skewness_about.1 seems to have high variability based on their Mean and Standard Deviation values.

Step 4: Exploratory Data Analysis (Univariate & Multivariate)

Univariate Analysis

Shapiro Test for Normality check

Based on the Shapiro test, we can see that none of the variables are inferred as 'Normally' distributed.

Skewness Check

Observations:

Handling Skewness with Log Transform

Imputing Missing Values

Instantiating KNN Imputer to impute the NaNs

Outliers Detection & Treatment

Visualising the Target variable class

We can see that the three classes are not balanced. 'Car' is the majority class with 50% of the total values whereas 'Bus' has 26% and 'Van' has 24% distribution.

Multivariate Analysis

Observations from KDE Plots:

KDE Plots of skewness_about, skewness_about.1 and skewness_about.2 show that these are very weak predictors of the Target variables. We will check the Pairplot and Correlation Matrix further and decide on the Feature Selection.

Pairplot of all variables in the dataset

Let us Label Encode the class target variable

Correlation Matrix

Highly correlated features with target variable class

Highly correlated Independent variables

Detecting Multicollinear features using Variance Inflation Factor (VIF)

Observations from the Multivariate Analysis & Correlation Matrix:

Unsupervised Learning Methods for EDA

Cophenetic Correlation Coefficient

Average linkage seems to be the better choice of Clustering for the given Dataset, inferred from the corresponding higher Cophenetic Correlation coefficient value of 0.70.

Dendograms to understand Feature Dependence

3-cluster dendogram seems to have given an output closer to the True Labels provided in the dataset.

Cluster Map with Metric = 'euclidean'

Feature Selection Methods

Step 5: Split the data into Train and Test set

Separating the Predictors and Target variables in X & y

Split the Train and Test set

View the shapes of Train and Test sets

Step 6: Data Preprocessing

Scaling & Centering the Data

Fit using Train data and Transform both Train and Test data

Step 7: Model Building - Training Classifier on Original Dataset

Support Vector Machines

Running SVM with Feature Selection subset

Hyperparameter Tuning for Support Vector Machines Classifier

Model Summary on Original Dataset with all dimensions

Step 8: Dimensionality Reduction using PCA

PCA using Numpy Linear Algebra

Deriving the Covariance Matrix

Eigen Decomposition of Covariance Matrix

Scree Plot for Explained Variance vs Principal Components

Observations from the Principal Component Analysis:

PCA using Scikit-Learn

Based on the Scree Plot, let us consider 6 dimensions for further analysis (reduced from original 12 dimensions)

Observations:

Step 9: Model Building on Dataset with Reduced Dimensions

Support Vector Machines

PCA with 2 components:

PCA with 6 components:

PCA with 8 components:

Hyperparameter Tuning for Support Vector Machines Classifier

Observations:

Step 10: Model Comparison - All-Dimensions vs Reduced-Dimensions

Cross Validation Score Comparison: All-Dimensions vs Reduced

Hypothesis Testing validation to compare All-Dimensions vs Reduced

Plotting the Learning Curve for both the models

We can see that the Learning Curve of Principal Components (reduced dataset) is smooth and able to achieve the threshold point with training size of 200, (against 400 for Original features).

Classifier Metrics Comparison for Original vs Reduced Data Models

SVC- Feature Selected has the similar Performance values as that of SVC-PCA with 8 components.

Learnings & Summary

Statistical Summary and Initial EDA:

Univariate Analysis:

Multivariate Analysis:

Unsupervised Learning Methods for EDA:

Feature Engineering & Selection:

Model Building - Original Features:

Model Building - with Reduced Dimensions and PCA:

Model Comparison:

--------------------------------------End of Assignment-------------------------------------------